DETERMINING STRUCTURAL SIMILARITY IN 
SEMI-STRUCTURED DOCUMENTS 

Field of the invention 

5 

The present invention relates to determining structural similarity in semi-structured 
documents. 

Background 

10 

Several methods exist that model documents as labeled trees. These methods are based on 
the fact that any semi-structured document that uses a markup language can be 
represented as a tree such as a Document Object Model (DOM) tree. The labels of the 
nodes correspond to the tags in the markup language. These methods define the structural 
15 dissimilarity between a pair of documents as the edit distance between the corresponding 
labeled trees. This is the tree model for the representation of the structural information. 

The basic idea behind all tree edit distance algorithms is to find the cheapest sequence of 
edit operations that will transform one tree into another. Some of these methods model 
20 documents as ordered labeled trees, while others model them as unordered labeled trees. 
In general, finding the edit distance between unordered labeled trees is computationally 
more complex than finding the edit distance between ordered labeled trees. A key 
differentiator among the various tree distance algorithms is the set of edit operations 
allowed. 

25 

Some work in this area used insertion and deletion of leaf nodes and relabelling of a node 
anywhere in the tree. Several other approaches with different sets of edit operations are 
proposed. These tree edit distance measures have been modified to address issues such as 
repetitive and optional fields. 

30 

For instance, Nierman et al [Andrew Nierman, H. V. Jagadish, "Evaluating Structural 
Similarity in XML Documents", Proceedings of the Fifth International Workshop on the 
Web and Databases (WebDB 2002), June 2002] propose a dynamic programming 
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algorithm that computes the distance between any pair of documents taking into account 
Extensible Markup Language (XML) issues such as optional and repeated sub-elements. 
Andrews et al further give a method to cluster documents based on this distance measure. 
The algorithm to compute the tree edit distance for a pair of documents is of quadratic 
5 complexity in the combined size of the two documents. 

Cruz et al [Isabel F. Cruz and Slava Borisov and Michael A. Marks and Timothy R. 
Webb, "Measuring Structural Similarity Among Web Documents: Preliminary Results", 
Lecture Notes in Computer Science^ volume 1375, page 513, 1998] propose an alternative 
10 approach to modeling structure based on tag frequency measures. This approach can be 
viewed as the node model for the representation of the structural information, since this 
approach only uses information about the tags of the various nodes in the corresponding 
tree model. 

15 The method of Isabel et al relies on the assumption that tag frequencies reflect some 
inherent characteristics of Web documents and correlate with its structure. While the node 
model is very simple, the model does not take into account the order in which tags appear. 
Therefore, if the tags of all nodes are rearranged, the representation does not change. 
Thus, the model is adequate only when the templates are drastically different from each 

20 other, that is, they have very few tags in common. This is rarely the case in practice. 

In view of the above comments, a need clearly exists for an improved manner of 
comparing documents for determining the structural similarity of the documents. 

25 Summary 

Techniques are presented herein for measuring the similarity between two pages based on 
their structural syntax. Structurally similar pages may differ in their textual and numeric 
contents. Documents, as well as document collections, are represented as vectors of 
30 feature values. These features are based on the words and phrases occurring in the 
document collection. Therefore, this representation of a document describes text (or 
possibly semantic) content of documents, and the similarity values describe a type of text 
or content similarity between the documents. 
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Several techniques exist to measure similarity between two numeric vectors. Such 
techniques are used to measure the similarity between two documents, and between a 
document and a document collection. 

For measuring the structural similarity between documents, documents are represented 
based on their structure. The structure arises from the various elements on the document 
and the nature of their nesting. After representing documents based on their structure in 
vector form, an existing method of measuring similarity between vectors is used to obtain 
the measure of structural similarity between two given documents. 

Description of drawings 

Fig. 1 is a schematic representation of an XML document and a corresponding labeled 
tree. 

Fig. 2 is a schematic representation of three respective Document Object Model (DOM) 
trees that are represented for purpose of comparison. 

Fig. 3 is a schematic representation of an example DOM tree labeled with positional 
indices, and is represented for the purpose of discussion. 

Fig. 4 is a flowchart of steps in a procedure for comparing a pair of documents. 

Fig. 5 is a schematic representation of a computer system suitable for performing the 
techniques described with reference to Figs. 1 to 4. 

Detailed description 

Two documents are typically assessed to be structurally similar if they have a similar 
"look and feel", or layout. By way of example, structurally similar pages might be 
generated in the following ways: 
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• Pages generated by providing values to a predetermined master shell page. 

• Pages dynamically generated on servers using server code 

• Pages generated in accordance of a template. 



s The techniques now described define a procedure for conducting a comparison of 
documents to reach a quantitative determination of the degree of structural similarity 
between documents. 



Document Object Model 

10 

A document is modelled as a labeled tree in the Document Object Model (DOM). In the 
labeled tree model of a document, each node in the tree corresponds to an element of the 
markup language in the document. The tag name of an element acts as a label for the 
node. The inclusion of a tag inside the scope of another tag is captured by a "parent-child" 
15 relationship in the labeled tree. Text nodes are excluded from the labeled tree, as text 
nodes are immaterial to the structural properties of the document. The tree representation 
of a document is sometimes known as the DOM (Document Object Model) tree. 



Fig. 1 depicts an XML document 110 and its corresponding DOM tree 120. Given a semi- 
20 structured document, one can generate its DOM tree manually, or using any suitable 
parser programme that analyses XML or related documents. Examples of suitable parsers 
are Xerces, XML4J, and Jtidy, though any suitable parser can of course be used. These 
and other suitable packages are generally available via the Web. The Jtidy package not 
only corrects common HTML errors, but also "XMLizes'' the document and provides the 
25 corresponding DOM tree for the document. 

Fig. 1 depicts an XML document 110 and its corresponding DOM tree 120. All other tags 
appearing in the document are contained within the scope of tag <a> (that is, the 
document tags appear between <a> and </a>) in the XML document 110, and therefore 
30 the root node in the corresponding DOM tree is labeled as a in 120. Tags <c> and <d> 
appear within the scope of tag <b> as directly nested tags and therefore nodes 3 and 4, 
with the labels c and d respectively, are the children of node 2 in the tree 120. Two <e> 
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tags and an <f> tag are directly nested inside the <d> tag in 110 and are therefore the 
children of node 4 are appropriately labeled in 120. 



Bag of Tree Paths Model 

5 

In the bag of tree paths model, a document is represented by a set of sequences of labels 
that occur in the paths from the root node to the leaf nodes of the corresponding tree 
representation (in this case, a DOM tree). The path from the root node to any leaf node 
contains the root node, the leaf node, and all the intermediate nodes required to reach the 
10 leaf node in sequence. Each such path contributes a sequence of labels to the model. 



As an example, leaf node 3 of the DOM tree 120 given in Fig. 1 contributes the sequence 
a/b/c. Similarly, node 5 contributes the sequence a/b/d/e. The same sequence of node 
labels can occur in two or more distinct paths. For example, node 5 and 6 in Fig. 1 
15 contributes the same sequence of node labels. A sequence of node labels is referred to as a 
path. 

Let D = fdj, d2, dn} be the collection of n trees corresponding to n documents. Let /w, 
be the number of leaf nodes present in tree rf,. There are thus w, paths in tree di, A 

20 dictionary of distinct paths Dictpaths can be constructed by collating such paths from all 
trees £fi ,1 < / < n. Not all paths, however, are equally important for describing the 
structure of documents in the bag of tree paths model. Thus, feature selection techniques 
are used to remove non-informative paths, to simplify the model. Paths that occur in very 
few documents, and paths that occur in almost all documents, are desirably eliminated 

25 from the dictionary. In general, however, any feature selection method that is deemed 
suitable can be used. 



Let the dictionary of paths after feature selection be Dictpaths"^ {pi» Pi. -^^Pn}- Let fipi) 
denote the frequency of occurrence of path /?, in document dj^ and let fnua = maxy f ipi). 
30 Now a document dj can be represented as a AT-dimensional vector [djj, dj2,..y djs], where djk 

^fj(pkyfmax,l<k<N. 
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The bag of tree paths model for a document captures only some of the structural 
relationships present in the tree. More precisely, the bag of tree paths model incorporates 
all the parent/child relationships, but ignores sibling relationships present in the tree 
structure. 

5 

The similarity between a pair of documents {dj , di) is defined as expressed in Equation 
[1] below. 

Xmin(c/,jt,^/J 

10 sim{d,,d,y^ [1] 

£max(^,„t/,J 

The numerator of the right hand side of Equation [1] is the sum (over all paths k in the 
dictionary of paths) of the minimum of the two frequencies of occurrence of a path k in 
the two documents di and di. This is a measure of how much the two documents have in 
15 common in terms of the various paths that appear in the dictionary. The denominator of 
Equation [1] is the sum of the maximum of the frequencies of occurrence over all paths 
k^ and serves as a normalizations factor. 

The frequencies fipt) can be ignored, and the occurrence or non-occurrence of the path 
20 used. In other words, dj is a binary vector. An important aspect of the bag of tree paths 
model is that the model can take into account markup language issues, such as repetition 
of elements in the similarity measure. Ideally, similarity value between a pair of 
documents should be high if the documents differ only in the number of times a particular 
markup language subelement occurs under an element. 

25 

Fig. 2 schematically represents three DOM trees 210, 220, 230. A meaningful similarity 
measure should yield a higher value of similarity for the pair of trees 210 and 230 than for 
the pair of trees 210 and 220. All the paths that appear in tree 210 also appear in tree 230. 
These trees only differ in the frequency of the path a/b/e. By contrast, the tree 210 has 
30 two paths that do not appear in tree 220 at all. The proposed similarity measure 
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determines that two documents that differ only in the frequency of the paths to be more 
similar than two documents that differ in the occurrence of paths. 

The parent-child relationship is sufficient for measuring the structural similarity of 
documents, if the documents contain many levels of nesting (that is, the depth of the 
corresponding trees is relatively high). If, however, the documents do not contain many 
levels of nesting (that is, the corresponding trees are shallow), then the documents will 
more probably contain the same paths, even if the overall structure differs substantially. 
For this reason, a preferred model, referred to as the bag ofXPaths model, is described. 

Bag ofXPaths Model 

The bag of XPaths model is described, which captures some sibling information in 
addition to all the parent/child relationships captured in the bag of tree paths model 
described above. 

In the bag of Xpaths model, a representation of a document as a labeled tree includes a 
positional index along with the label for each node. The positional index of a node n 
advises that the number of previous sibling nodes with the same label as that of node n. 

Fig. 3 shows an example of a labeled tree 310. Each node in the tree contains a label as 
well as a positional index. The term index is used as an abbreviation for positional index. 
An XPath is defined in some contexts to be a path expression that locates nodes in a 
DOM tree. The term is used herein, however, to mean any mechanism to describe a path 
in a tree representation of a document that includes, with the label of a node, the 
information about the number of siblings of the nodes that have the same label as the 
node. Consequently, an XPath, as used herein, is essentially a form of path representation, 
in which particular path-specific information is included, in varying degrees of 
specificity. 

The term XPath is defined herein as a sequence of terms separated by the character 7' . 
The syntax for each term is as follows: term=nodetest[predicateJ. In the existing 
technical literature, however, the term XPath is used more generically to denote path 



616430US 



2003.07.2S 



[I:\ELEC\1BM\JP920030083US 1 ljp920030083usl .textriiial.doc:dinp 



I 

I 

I 

-8- 

expressions that locate nodes in a DOM tree. There are several ways to express the 
location of a set of nodes in a DOM tree using XPaths. As noted above, though the term 
XPath is used herein in a more precise and restricted sense. An XPath is considered 
herein as a path in a tree representation of a document that has the capability to include 
5 the positional information of the preceding siblings that have the same label as a given 
node in the tree. 

The term nodetest is a label that defines a set of nodes (which are referred to as a node- 
set), in which each node in the set is a child node of the current node that has nodetest as 
10 its label. The label table is an example of nodetest. A predicate filters the node-set 
specified by the nodetest further into a smaller node-set and is always placed inside a pair 
of square brackets. [position()=index] and [positionO<7] are examples of predicates. 

Predicates of the form [position()=index] are abbreviated as [index]. A predicate is called 
15 an equality predicate if the predicate is of the form [position()=index]. Other predicates 
are termed generalized predicates. An XPath is called an equality XPath if all the terms in 
the XPath contain only equality predicates. If some of the terms in an XPath contain 
generalized predicates, the XPath is called a generalized XPath. For example, in Fig. 3, 
the XPath for node 3 is /a[l]/b[l]/c[l]. For node 6, the XPath is/a[l]/bll]/d[l]/e[2]. 

20 

Both of these XPaths are equality XPaths. The XPath of the form 
/a[l]/b[l]/d[l]/e[positionO < 3], is an example of a generalized XPath which evaluates 
to the node-set containing node S and node 6. Note that the last term of the XPath 
contains a generalized predicate. If any of the terms in an XPath contains a wildcard 
25 (that is, or appears without a predicate, then the Xpath is still considered to be a 
generalized XPath A nodetest of a term is referenced by term.nodetest. Similarly, the 
predicate can be referenced by term.predicate. An index of an equality term can be 
referenced by term, index. 

30 An XPath not only captures the parent/child relationships between nodes but also 
incorporates some sibling information. For example, in an equality XPath, a positional 
index that occurs in a term indicates how many preceding siblings have the same nodetest 
(that is, have the same node label) in the DOM tree. An XPath, however, does not capture 
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sibling information of nodes whose node labels are different from that of the given node. 
For example, the XPath for node 6 in Fig. 3 is /a[ll/b[l]/d[l]/e[2]. The positional index 2 
for the node label e shows that this node has another sibling node with the label e. There 
is no reference hence to the sibling node / 

5 

A document d can be defmed by the set of all equality XPaths corresponding to the leaf 
nodes of the DOM tree for d. Note that each equality XPath occurs only once in a 
document. 

10 A dictionary Dictxpath can be constructed in a fashion similar to the one described in the 
bag of paths model, based on the equality XPaths for all the leaf nodes of the DOM trees 
in the document collection. Taking account of issues such as repetitions is not 
straightforward in this model. Elements that appear as a result of repetition differ in their 
positional indices, and thus have different XPaths. This is unlike the situation in the bag 

15 of tree paths model, in which all such paths are identical. Therefore, if the similarity 
measure defined for the bag of tree paths model is directly used in the bag of XPaths 
model, two documents that differ only in the number of repeating elements will have a 
relatively low similarity value. 

20 If the number of repeating elements are different, then there are some XPaths that differ 
only in the positional index. An element gets its positional index based on how many 
preceding sibling nodes have the same label. For a case in which the number of repetitive 
elements is different in the two documents, there will be a different number of preceding 
siblings with the same label. Accordingly, an alternative similarity measure, which 

25 incorporates the issue of repetition for the bag of XPaths model, is now defined. 

Similarity measure 

A special type of defined generalized predicate is referred to as repetitive predicate. A 
30 repetitive predicate is of the form [(positionQ-init) moddijff^O], [(position()-l) mod 5=0] 
is an example of repetitive predicate, in which the function positionQ returns the 
positional index of the node. The index values 1, 6, 11, etc. satisfy this repetitive 
predicate. In this example, the value of diffxs 5 and the value for init is 1. Now, one may 
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observed that [(position()-l) mod 5 = 0] is satisfied for the 1 position, 6 position, and so 
on. That is, for positions that satisfy the expression [(6-1) mod 5] = 0]. 

A generalized XPath, which contains only equality and repetitive predicates^ is called a 
repetitive XPath. As such a repetitive XPath contains at least one repetitive predicate as 
defined above. 

Table 1 below presents pseudocode that defines a Boolean fimction called 
subsume(Xi,X^^ in which Xi and X2 are XPaths (either equality or repetitive). The 
function returns "/n/e" if the set of nodes evaluated by the XPath X2 is a subset of the 
nodes evaluated by Xi for the same tree. When the function returns "/rwe", the XPath X| 
is said to subsume the XPath X2. 

As an example, consider two XPaths Xi=/tagl[l]/tag2[(position()-l mod 5)=0] and X2 
=/tagl[l]/tag2[l]. All nodes evaluated by X2 are also evaluated by Xi, and hence one 
concludes that the XPath Xi subsumes the XPath X2. The function subsume given in 
Table 1 below does not evaluate the given XPaths on a tree, but uses another way to 
determine subsumption. In this algorithm, the function depth returns the number of terms 
present in the given XPath. The function evaluate (p,i) returns "true** if the index i 
satisfies the predicate p. Here, /e/W, represents the j* term in the XPath Xi. The 
algorithm compares the given XPaths term by term and returns true if predicates for all 
the terms either match exactly or the index of second XPath is satisfied by the predicate 
of first XPath. 



TABLE 1 

boolean subsume(Xj X^} { 
\f(depth(Xi) ^ depth(X^) 

return false 
flag = true ; 

for every term /, of Xpath Xi do 
if (term' j.nodetest ^ term' z-nodetest) 
return false 

if (term' i.predicate # term' 2 predicate) 
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continue 
else 

let p be the terrn^ ^predicate 
if (p is equality predicate) 
5 return false 

else 

flag = flag A evaluate(p,term 2. index) 
return flag 

} 



Table 2 below presents pseudocode that defines a function called generalize(X],X^. This 
code either returns a generalized( repetitive) XPath that subsumes both XPaths Xi and X2, 
or returns "nuir\ As an example, consider two XPaths X|=a[l]/b[l]/c[l]/d[l] and 
15 X2=a[ll/b[l]/c[6]/d[ll. The function generaIize(Xj,X2) returns the generalized XPath Xg 
= a[l]/b[l]/c[(positionO -1) mod 5 = 0]/d[l]. The predicate c[(position() -1) mod 5=0] 
will evaluate to c[l], c[6], c[l 1] and so on. 



20 TABLE 2 

boolean generalize(Xi X2) { 
if (depth(Xj) ^ depth(X2)) 
return null 
25 gxpath = " " 

for every term of Xpath Xi do 

if (ternf jModetest ^ ternf 2'^odetest) 

return null 
if (term^jAndex == term^2ii^dex) 
30 gxpath = gxpath + + termS 

continue 
else 

diff= \term^ i.index ^ terni2'index \ 
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init =term' i.index mod diff 

gxpath = gxpath + "/" Herm' inocktest + "[" + ((positionQ - init) mod diff) == 0 + 

ttj if 

return gxpath 

} 



A document rf, can be represented as a bit binary vector where N is the number of 
terms in Dictxpaths- Let Dei denote the set of ail equality XPaths that are present in 
document rf,. 

Note that Dei c Dictxpaths- A set of generalized XPaths is generated, based on pairs of 
equality XPaths in Dei using the algorithm defmed in Table 2. Let Dgi denote the set of 
all generalized XPaths that are obtained using gerteralize(Xi,Xz). The algorithm attempts 
to generalize two equality XPaths only if both of them have the same tree path, that is, the 
two equality XPaths without the positional indices must be identical. A tree path to XPath 
index is created so that for any given tree path, one can quickly obtain the set of equality 
XPaths that have the same tree path. 

Let X denote one such set of equality XPaths, and let T be the number of terms in any 
XPath e ^ . All the XPaths will have same number of terms. A pair of XPaths Xi and X2 
are chosen in X, such that there exists some /; i < / < T for which term /.index ^ 
term2. index. Here term/ .index and term/. index, are the lowest two indices for the 
term/.nodetest and k ^ i, termj .index = term2 .index. 

That is, two XPaths are generalized if and only if they differ in the positional index at 
exactly one term t. Further, the two indices should be the lowest two indices that occur for 
the label associated with the term / in the set X. Therefore, the number of generalized 
XPaths that one can derive from a tree path is bounded by T. 

The complete set of XPaths for document is Dei u Dgi. Now, the similarity measure for 
a pair of documents rf, and dj can be defined as follows as expressed in Equation [2] 
below. 
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In Equation [2] above, e is the number of XPaths that are common to both di, dj. The 
term s is the total number of XPaths that do not exactly match but are subsumed by at 
least one of the generalized XPaths of the other document. The term n is the number of 
XPaths in rf,, and m is the number of XPaths in dj. To compute 5, the function subsume 
described above with reference to Table 1 is used. 

The subsume and generalize functions, given in Tables 1 and 2 respectively, incorporate 
the issue of repetition in the bag of XPaths model. This model can accommodate other 
aspects, such as optional elements and recursive elements, by using other subsume and 
generalize functions. If the application at hand requires optional and recursive control 
structures to be incorporated in the similarity measure, suitable subsume and generalize 
functions for these control structures can be used. In other words, the functions subsume 
and generalize for specific control structures can be designed based on the application at 
hand. 

Overview of procedure 

Fig. 4 is a flowchart 400 that represents, in overview, steps in a simple example of 
comparing two documents. The flowchart 400 describes the procedure for obtaining the 
similarity between two documents di and di in a given document collection. These steps 
are as outlined below. 

Step 410 Model all the documents as labeled trees and build a dictionary of paths 



or XPaths based on the bag of Paths or bag of XPaths model as 
required. Let Dict{pj, P2,'''»Pn} be the dictionary of size N. 



Step 420 



Represent each document dj in the collection as an A^-dimensional 
vector [djj, dj2,.., ^a^], where element / of the vector, that is, djj denotes 
the value of some feature associated with path pi, such as the presence 



616430US 



2003.07.25 



[l:\ELEC\IBMUP920O30083USl]jp920030083usl.textftn8i.doc:<lnip 



14 



or absence of path p„ or the frequency of occurrence of path pi in the 
document. 

Step 430 Use the similarity measure given in Equation [1] or Equation [2] to 

obtain the similarity value between two documents based on the bag of 
Paths model or bag ofXPaths model. 

Implementation using computer hardware and software 

Fig. 5 is a schematic representation of a computer system 500 that can be used to 
implement the techniques described herein. Computer software executes under a suitable 
operating system installed on the computer system 500 to assist in performing the 
described techniques. This computer software is programmed using any suitable computer 
programming language, and may be thought of as comprising various software code 
means for achieving particular steps. 

The components of the computer system 500 include a computer 520, a keyboard 510 and 
mouse 515, and a video display 590. The computer 520 includes a processor 540, a 
memory 550, input/output (I/O) interfaces 560, 565, a video interface 545, and a storage 
device 555. 

The processor 540 is a central processing unit (CPU) that executes the operating system 
and the computer software executing under the operating system. The memory 550 
includes random access memory (RAM) and read-only memory (ROM), and is used 
under direction of the processor 540. 

The video interface 545 is connected to video display 590 and provides video signals for 
display on the video display 590. User input to operate the computer 520 is provided from 
the keyboard 510 and mouse 515. The storage device 555 can include a disk drive or any 
other suitable storage medium. 
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Each of the components of the computer 520 is connected to an internal bus 530 that 
includes data, address, and control buses, to allow components of the computer 520 to 
communicate with each other via the bus 530. 

The computer system 500 can be connected to one or more other similar computers via a 
input/output (I/O) interface 565 using a communication channel 585 to a network, 
represented as the Internet 580. 

The computer software may be recorded on a portable storage medium, in which case, the 
computer software program is accessed by the computer system 500 from the storage 
device 555. Alternatively, the computer software can be accessed directly from the 
Internet 580 by the computer 520. In either case, a user can interact with the computer 
system 500 using the keyboard 510 and mouse 515 to operate the programmed computer 
software executing on the computer 520. 

Other configurations or types of computer systems can be equally well used to implement 
the described techniques. The computer system 500 described above is described only as 
an example of a particular type of system suitable for implementing the described 
techniques. 

Example applications of document similarity measures 
Application to information extraction 

There has been much work in the area of information extraction from Web pages in the 
recent years. One approach to this problem is to generate a wrapper using an example 
page. The fields that contain the desired information are indicated by the user. A wrapper 
is created to capture the extraction rules based on the patterns exhibited by the indicated 
fields in the example page. The wrapper is then used to extract similar information from 
all the pages that are structurally similar to the given example page. This approach, 
however, requires that all structurally similar pages are first identified and grouped. The 
proposed similarity measure can be used to cluster pages based on structural similarity. 
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Application to Document Type Defining (DTD) induction 

In the case of XML documents, knowledge of the DTD can facilitate the identification of 
structurally similar pages. Unfortunately, most of the XML documents on the Web are 
found without their DTDs. There are several induction algorithms that attempt to learn the 
DTD from a set of examples. These approaches assume that all the examples come from 
the same DTD. If the XML pages in a given collection come from different DTDs, these 
algorithms cannot be used directly, since it is theoretically infeasible to learn a single 
DTD for the entire collection. A possible solution is to partition the collection into smaller 
sets of "structurally similar" documents, and then learn the DTD for each set. Again, one 
can use the proposed similarity measure to cluster the pages based on "structural 
similarity". 

Application to template removal 

Common information contained in templates hinders the performance of many 
information retrieval and data mining algorithms. The common information is sometimes 
referred to as template information. Using the similarity measure described herein, one 
can determine an approach to identify this template information. The documents from a 
collection are first clustered based on their structure. A cluster contains pages that share a 
common look and feel. The text that appears at the same location in different pages within 
a cluster is identified as the common information. 

Conclusion 

A method, computer software, and a computer system are each described herein in the 
context of document structure comparison. Various alterations and modifications can be 
made to the techniques and arrangements described herein, as would be apparent to one 
skilled in the relevant art. 
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